In this project, we predict concrete strength using eight other predictor variables. We implement feature engineering, hyperparameter tuning, and cross-validation in aims of boosting our models' performance
import numpy as np
%matplotlib inline
import pandas as pd
import seaborn as sns
sns.set(color_codes=True)
from scipy.stats import pearsonr
from matplotlib import pyplot as plt
df = pd.read_csv('concrete.csv')
df.head()
#The variables slag, ash, superplastic all range from 0 to a certain value
#The minimum and maximum ages are 1 and 365, respectively. The average age of the concrete is approx 45
df.describe()
#There are no null values in this dataset
df.isnull().any()
#All data types are float except for age. 'strength' is the variable we wish to predict
#There are 1030 observations and 9 columns. This seems to be a quite small dataset
''' The first seven variables in this list are measured in: kg in a m3 mixture. Age is measured in days
(from 1-365). Concrete strength is measured in MPa'''
df.info()
#The variables cement and strength are very roughly normally distributed
#The variables slag, ash, water, superplastic, and courseagg seem to have roughly bimodal distributions
#
import warnings
warnings.filterwarnings('ignore')
for i in df.columns:
sns.distplot(df[i])
plt.show()
Since the data only has 1,030 observations, I will not be stringent on removing outliers in order to prevent overfitting. The superplastic variable and age variable both are strongly right skewed. However, I will choose to remove the few observations in age that are past two standard deviations from the mean, i.e past 171 days in age
#We note that the highest number of outliers seems to be present with age
#We choose to remove outliers for 'age' and keep all other observations due to the small size of the dataset
for i in df.columns:
sns.boxplot(x=df[i])
plt.show()
#It seems that none of the predictor variables have a strong correlation with the 'strength' variable
#There may be slight positive correlation between superplastic & strength, superplastic, & water
sns.pairplot(df, diag_kind='kde')
f, ax = plt.subplots(figsize=(11, 9))
corr = df.corr()
heatmap = sns.heatmap(corr, cmap="YlGnBu", annot= True)
From this heatmap, I noted that there is little to no correlation between predictor variables in this dataset. I see that cement, superplastic, and age are the variables most correlated with strength and thus choose to keep hese variables in the analysis. There is no need to remove any variables since we are working with a small number of features that do not seem to have covariance. Before continuing with bivariate analysis, I will do some feature engineering to create categorical variables for the age, water, and courseagg variables to see if I can increase their correlation values with the predictor variable
#4 categories are made for these three variables and inserted back into the dataframe
#These intervals were created based on the quartiles found in the beginning of the analysis
age_cat = pd.cut(df.age,bins=[-1,8,29,57,366],labels=['Recent','Average','Older','Very Old'])
water_cat = pd.cut(df.water,bins=[121.6,164.8,186,193,248],labels=['Low','Average','High','Very High'])
coarseagg_cat = pd.cut(df.coarseagg,bins=[800,933,969,1029.5,1146],labels=['Low','Average','High','Very High'])
df.insert(8, 'age_cat', age_cat)
df.insert(4, 'water_cat', water_cat)
df.insert(6, 'coarseagg_cat', coarseagg_cat)
df.head()
#This is interesting. It appears that although younger aged concrete tends to have less strength
#Older concrete tends to have more strength than very old concrete as well
#It seems that the amount of water does not seem to have a big impact on concrete strength
#since the densities for the groups overlap. Low water may possibly correlate with strength however
#Courseagg levels do not seem to effect concrete strength either
f, ax = plt.subplots(figsize=(15, 8), ncols=3)
sns.kdeplot(data=df, x="strength", hue="age_cat",
fill=True, common_norm=False,
alpha=.5, linewidth=0, ax=ax[0])
sns.kdeplot(data=df, x="strength", hue="water_cat",
fill=True, common_norm=False,
alpha=.5, linewidth=0, ax=ax[1])
sns.kdeplot(data=df, x="strength", hue="coarseagg_cat",
fill=True, common_norm=False,
alpha=.5, linewidth=0, ax=ax[2])
#We explore the categorical age variable further here using jointplots
#It seems that for all age groups, the slag variable has a peak at 0 with varying cement values
#I note that the average age group tends to have the most variability across numerical variables
#However, all age groups tend to follow similar patterns for each numerical variables looking at the KDE
sns.jointplot(data=df, x="cement", y="slag", hue="age_cat")
sns.jointplot(data=df, x="superplastic", y="coarseagg", hue="age_cat")
sns.jointplot(data=df, x="superplastic", y="ash", hue="age_cat")
sns.jointplot(data=df, x="water", y="slag", hue="age_cat")
sns.jointplot(data=df, x="ash", y="fineagg", hue="age_cat")
sns.jointplot(data=df, x="superplastic", y="cement", hue="age_cat")
sns.jointplot(data=df, x="superplastic", y="water", hue="age_cat")
sns.jointplot(data=df, x="water", y="fineagg", hue="age_cat")
sns.jointplot(data=df, x="ash", y="cement", hue="age_cat")
#Overall, there are a few outliers within the age group but I will choose to keep them in the analysis
#This is because we have so few observations the model may overfit if there is too little variability
#From the boxplots, I notice that there are significant amounts of skew. Scaling may be necessary
f, ax = plt.subplots(figsize=(20, 15), ncols=2, nrows=4)
sns.boxplot(ax=ax[0, 0], data=df, x='age_cat', y='cement')
sns.boxplot(ax=ax[0, 1], data=df, x='age_cat', y='slag')
sns.boxplot(ax=ax[1, 0], data=df, x='age_cat', y='ash')
sns.boxplot(ax=ax[1, 1], data=df, x='age_cat', y='water')
sns.boxplot(ax=ax[2, 0], data=df, x='age_cat', y='superplastic')
sns.boxplot(ax=ax[2, 1], data=df, x='age_cat', y='coarseagg')
sns.boxplot(ax=ax[3, 0], data=df, x='age_cat', y='fineagg')
sns.boxplot(ax=ax[3, 1], data=df, x='age_cat', y='strength')
#Since fineagg and coarseagg do not have high correlations with concrete strength, I will combine them
#in attempt to increase it. This will be the total aggregate in the cement mixture
df['totalagg'] = df['fineagg'] + df['coarseagg']
cols = ['cement', 'slag', 'ash', 'water', 'water_cat', 'superplastic', 'coarseagg_cat', 'coarseagg',
'fineagg', 'totalagg', 'age', 'age_cat' ,'strength']
df = df[cols]
df.head()
#Finally, I get an overall snapshot of how the numerical data looks with an age hue applied to it
#For all variables, it seems that younger aged concrete tends to have less strength overall
#This will be a good categorical variable to use in the model
sns.pairplot(df, diag_kind='kde', hue = 'age_cat')
#I note that the created totalagg variable increased the magnitude of correlation with the strength variable
#I choose to keep this as a predictor variable when building the model
df.corr()[['coarseagg', 'fineagg', 'totalagg']]
#Transforming the categorical data into numerical data
#I choose to simply replace the categorical values with numerical ones
#Label encoding is not used since it randomly assigns numbers to each category value
#Since we have some type of order (ex. very high>high), I will manually assign the numbers
mapping1 = {'Low': 1, 'Average': 2, 'High': 3, 'Very High':4}
df = df.replace({'water_cat': mapping1, 'coarseagg_cat': mapping1})
mapping2 = {'Recent': 1, 'Average': 2, 'Older': 3, 'Very Old':4}
df = df.replace({'age_cat': mapping2})
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df[['cement', 'slag', 'ash', 'water', 'superplastic', 'coarseagg',
'fineagg', 'totalagg', 'age', 'strength']] = scaler.fit_transform(df[['cement', 'slag', 'ash', 'water', 'superplastic', 'coarseagg',
'fineagg', 'totalagg', 'age', 'strength']])
from sklearn.model_selection import train_test_split
#Making the training set, validation set, and testing set 60%, 20%, and 20%, respectively
X_train, X_test, y_train, y_test = train_test_split(df.drop('strength',axis=1), df['strength'], test_size=0.2, random_state=1)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25, random_state=0)
print(X_train.shape)
print(X_test.shape)
#Note that the target variable is continuous. Hence, I must use regressor models
#I will run all the models without tuning, using default parameters first
from sklearn.tree import DecisionTreeRegressor
dt_model = DecisionTreeRegressor(random_state=0)
dt_model.fit(X_train, y_train)
train_score = dt_model.score(X_train, y_train)
test_score = dt_model.score(X_val, y_val)
predictions = dt_model.predict(X_val)
train_pred = dt_model.predict(X_train)
from sklearn.metrics import confusion_matrix, recall_score, roc_auc_score,accuracy_score, mean_squared_error, mean_absolute_error
rmse = np.sqrt(mean_squared_error(y_val, predictions))
mae = mean_absolute_error(train_pred, y_train)
results = pd.DataFrame({'Method':['Decision Tree Model'], 'Train Accuracy': [train_score],
'Test Accuracy': [test_score], 'RMSE': [rmse], 'Mean Absolute Error': [mae]})
results = results[['Method', 'Train Accuracy', 'Test Accuracy', 'RMSE', 'Mean Absolute Error']]
results
from sklearn.ensemble import GradientBoostingRegressor
gboost_model = GradientBoostingRegressor(random_state=0)
gboost_model.fit(X_train, y_train)
train_score = gboost_model.score(X_train, y_train)
test_score = gboost_model.score(X_val, y_val)
predictions = gboost_model.predict(X_val)
train_pred = gboost_model.predict(X_train)
rmse = np.sqrt(mean_squared_error(y_val, predictions))
mae = mean_absolute_error(train_pred, y_train)
temp = pd.DataFrame({'Method':['Gradient Boosting Regressor'], 'Train Accuracy': [train_score],
'Test Accuracy': [test_score], 'RMSE': [rmse], 'Mean Absolute Error': [mae]})
results = pd.concat([results, temp])
results = results[['Method', 'Train Accuracy', 'Test Accuracy', 'RMSE', 'Mean Absolute Error']]
results
from sklearn.ensemble import AdaBoostRegressor
aboost_model = AdaBoostRegressor(random_state=0)
aboost_model.fit(X_train, y_train)
train_score = aboost_model.score(X_train, y_train)
test_score = aboost_model.score(X_val, y_val)
predictions = aboost_model.predict(X_val)
train_pred = aboost_model.predict(X_train)
rmse = np.sqrt(mean_squared_error(y_val, predictions))
mae = mean_absolute_error(train_pred, y_train)
temp = pd.DataFrame({'Method':['AdaBoost Regressor'], 'Train Accuracy': [train_score],
'Test Accuracy': [test_score], 'RMSE': [rmse], 'Mean Absolute Error': [mae]})
results = pd.concat([results, temp])
results = results[['Method', 'Train Accuracy', 'Test Accuracy', 'RMSE', 'Mean Absolute Error']]
results
from sklearn.ensemble import BaggingRegressor
bag_model = BaggingRegressor(random_state=0)
bag_model.fit(X_train, y_train)
train_score = bag_model.score(X_train, y_train)
test_score = bag_model.score(X_val, y_val)
predictions = bag_model.predict(X_val)
train_pred = bag_model.predict(X_train)
rmse = np.sqrt(mean_squared_error(y_val, predictions))
mae = mean_absolute_error(train_pred, y_train)
temp = pd.DataFrame({'Method':['Bagging Regressor'], 'Train Accuracy': [train_score],
'Test Accuracy': [test_score], 'RMSE': [rmse], 'Mean Absolute Error': [mae]})
results = pd.concat([results, temp])
results = results[['Method', 'Train Accuracy', 'Test Accuracy', 'RMSE', 'Mean Absolute Error']]
results
from sklearn.ensemble import RandomForestRegressor
rf_model = RandomForestRegressor(random_state=0)
rf_model.fit(X_train, y_train)
train_score = rf_model.score(X_train, y_train)
test_score = rf_model.score(X_val, y_val)
predictions = rf_model.predict(X_val)
train_pred = rf_model.predict(X_train)
rmse = np.sqrt(mean_squared_error(y_val, predictions))
mae = mean_absolute_error(train_pred, y_train)
temp = pd.DataFrame({'Method':['Random Forest Regressor'], 'Train Accuracy': [train_score],
'Test Accuracy': [test_score], 'RMSE': [rmse], 'Mean Absolute Error': [mae]})
results = pd.concat([results, temp])
results_untuned = results[['Method', 'Train Accuracy', 'Test Accuracy', 'RMSE', 'Mean Absolute Error']]
results_untuned
Insights: We can see that the random forest regressor has the best performance, although overfitting is occurring. This model also has the lowest RMSE. The decision tree model possesses the least mean absolute error. This model's train accuracy is very high although the test accuracy is not as high as the gradient boosting regressor or the random forest regressor. This means that the decision tree regressor is overfitting to the training set as well. The adaboost regressor performed the worst and has the highest errors. All other models are quite prone to overfitting however. Now, I will tune this models in order to reduce overfitting and make these models perform even better.
Decision Tree Regressor
from scipy.stats import randint as sp_randint
from sklearn.model_selection import RandomizedSearchCV
clf = DecisionTreeRegressor(random_state = 0)
param_dist = {"max_depth": sp_randint(1, 101),
"max_features": sp_randint(1, 51),
"min_samples_split": sp_randint(2, 51),
"min_samples_leaf": sp_randint(1, 51),
"criterion": ["gini", "entropy", "mse"]}
samples = 200 # number of random samples
dt_randomCV = RandomizedSearchCV(clf, param_distributions = param_dist, n_iter = samples, random_state = 0)
dt_randomCV.fit(X_val, y_val)
print(dt_randomCV.best_params_)
dt_randomCV.best_estimator_
dt_model = dt_randomCV.best_estimator_
dt_model.fit(X_train, y_train)
train_score = dt_model.score(X_train, y_train)
test_score = dt_model.score(X_val, y_val)
predictions = dt_model.predict(X_val)
train_pred = dt_model.predict(X_train)
rmse = np.sqrt(mean_squared_error(y_val, predictions))
mae = mean_absolute_error(train_pred, y_train)
results = pd.DataFrame({'Method':['Decision Tree Model'], 'Train Accuracy': [train_score],
'Test Accuracy': [test_score], 'RMSE': [rmse], 'Mean Absolute Error': [mae]})
results = results[['Method', 'Train Accuracy', 'Test Accuracy', 'RMSE', 'Mean Absolute Error']]
results
Gradient Boosting Regressor
clf = GradientBoostingRegressor(random_state = 0)
param_dist = {"max_depth": sp_randint(1, 101),
"max_features": sp_randint(1, 51),
"min_samples_split": sp_randint(2, 51),
"min_samples_leaf": sp_randint(1, 51),
"criterion": ["gini", "entropy", "friedman_mse"]}
samples = 100 # number of random samples
gb_randomCV = RandomizedSearchCV(clf, param_distributions = param_dist, n_iter = samples, random_state = 0)
gb_randomCV.fit(X_val, y_val)
print(gb_randomCV.best_params_)
gboost_model = gb_randomCV.best_estimator_
gboost_model.fit(X_train, y_train)
train_score = gboost_model.score(X_train, y_train)
test_score = gboost_model.score(X_val, y_val)
predictions = gboost_model.predict(X_val)
train_pred = gboost_model.predict(X_train)
rmse = np.sqrt(mean_squared_error(y_val, predictions))
mae = mean_absolute_error(train_pred, y_train)
temp = pd.DataFrame({'Method':['Gradient Boosting Regressor'], 'Train Accuracy': [train_score],
'Test Accuracy': [test_score], 'RMSE': [rmse], 'Mean Absolute Error': [mae]})
results = pd.concat([results, temp])
results = results[['Method', 'Train Accuracy', 'Test Accuracy', 'RMSE', 'Mean Absolute Error']]
results
Adaboost Regressor
clf = AdaBoostRegressor(random_state = 0)
param_dist = {"n_estimators": sp_randint(1, 101),
"learning_rate": sp_randint(1, 51),
"loss": ["linear", "square", "exponential"]}
samples = 100 # number of random samples
ab_randomCV = RandomizedSearchCV(clf, param_distributions = param_dist, n_iter = samples, random_state = 0)
ab_randomCV.fit(X_val, y_val)
print(ab_randomCV.best_params_)
aboost_model = ab_randomCV.best_estimator_
aboost_model.fit(X_train, y_train)
train_score = aboost_model.score(X_train, y_train)
test_score = aboost_model.score(X_val, y_val)
predictions = aboost_model.predict(X_val)
train_pred = aboost_model.predict(X_train)
rmse = np.sqrt(mean_squared_error(y_val, predictions))
mae = mean_absolute_error(train_pred, y_train)
temp = pd.DataFrame({'Method':['AdaBoost Regressor'], 'Train Accuracy': [train_score],
'Test Accuracy': [test_score], 'RMSE': [rmse], 'Mean Absolute Error': [mae]})
results = pd.concat([results, temp])
results = results[['Method', 'Train Accuracy', 'Test Accuracy', 'RMSE', 'Mean Absolute Error']]
results
Bagging Regressor
clf = BaggingRegressor(random_state = 0)
param_dist = {"n_estimators": sp_randint(1, 101),
"max_samples": sp_randint(1, 51),
"bootstrap": [True, False],
"warm_start": [True, False]}
samples = 100 # number of random samples
bag_randomCV = RandomizedSearchCV(clf, param_distributions = param_dist, n_iter = samples, random_state = 0)
bag_randomCV.fit(X_val, y_val)
print(bag_randomCV.best_params_)
bag_model = bag_randomCV.best_estimator_
bag_model.fit(X_train, y_train)
train_score = bag_model.score(X_train, y_train)
test_score = bag_model.score(X_val, y_val)
predictions = bag_model.predict(X_val)
train_pred = bag_model.predict(X_train)
rmse = np.sqrt(mean_squared_error(y_val, predictions))
mae = mean_absolute_error(train_pred, y_train)
temp = pd.DataFrame({'Method':['Bagging Regressor'], 'Train Accuracy': [train_score],
'Test Accuracy': [test_score], 'RMSE': [rmse], 'Mean Absolute Error': [mae]})
results = pd.concat([results, temp])
results = results[['Method', 'Train Accuracy', 'Test Accuracy', 'RMSE', 'Mean Absolute Error']]
results
Random Forest Regressor
clf = RandomForestRegressor(random_state = 0)
param_dist = {"max_depth": [None, 3, 5, 10, 15],
"max_features": sp_randint(1, 51),
"min_samples_split": sp_randint(2, 51),
"min_samples_leaf": sp_randint(1, 51),
"criterion": ["gini", "entropy", "mse"]}
samples = 100 # number of random samples
rf_randomCV = RandomizedSearchCV(clf, param_distributions = param_dist, n_iter = samples, random_state = 0)
rf_randomCV.fit(X_val, y_val)
print(rf_randomCV.best_params_)
rf_model = rf_randomCV.best_estimator_
rf_model.fit(X_train, y_train)
train_score = rf_model.score(X_train, y_train)
test_score = rf_model.score(X_val, y_val)
predictions = rf_model.predict(X_val)
train_pred = rf_model.predict(X_train)
rmse = np.sqrt(mean_squared_error(y_val, predictions))
mae = mean_absolute_error(train_pred, y_train)
temp = pd.DataFrame({'Method':['Random Forest Regressor'], 'Train Accuracy': [train_score],
'Test Accuracy': [test_score], 'RMSE': [rmse], 'Mean Absolute Error': [mae]})
results = pd.concat([results, temp])
results_tuned = results[['Method', 'Train Accuracy', 'Test Accuracy', 'RMSE', 'Mean Absolute Error']]
results_untuned
results_tuned
Insights: In the untuned batch of models, it is very clear to see that overfitting is occurring. This is because the training accuracy is incredibly high, which will lead to adverse consequences when the models see data it has never seen before. After tuning, we see that these training accuracies fell down to an appropriate level, reducing the extent of overfitting. The gradient boosting regressor and adaboost regressor had higher accuracy in the test set after tuning as well, with reduced RMSE values. Overfitting is still a problem in the gradient boosting regressor even after tuning however. Although some test accuracies in the tuned models went down, overfitting was reduced by a large margin, ensuring that performance will remain high for unseen data. Thus far, we have used the validation set as our testing set. My models have not seen the testing data yet. Thus, I finally run K-fold cross validation in order to observe how these tuned models perform on new data
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
import warnings
warnings.filterwarnings('ignore')
kfold = KFold(n_splits= 4, random_state= 0)
model = dt_randomCV.best_estimator_
results = cross_val_score(model, X_test, y_test, cv=kfold)
print("Decision Tree Regressor Accuracy: %.3f%%, SD: (%.3f%%)" % (results.mean()*100.0, results.std()*100.0))
kfold = KFold(n_splits= 4, random_state= 0)
model = gb_randomCV.best_estimator_
results = cross_val_score(model, X_test, y_test, cv=kfold)
print("Gradient Boosting Regressor Accuracy: %.3f%%, SD: (%.3f%%)" % (results.mean()*100.0, results.std()*100.0))
kfold = KFold(n_splits= 4, random_state= 0)
model = ab_randomCV.best_estimator_
results = cross_val_score(model, X_test, y_test, cv=kfold)
print("AdaBoost Regressor Accuracy: %.3f%%, SD: (%.3f%%)" % (results.mean()*100.0, results.std()*100.0))
kfold = KFold(n_splits= 4, random_state= 0)
model = bag_randomCV.best_estimator_
results = cross_val_score(model, X_test, y_test, cv=kfold)
print("Bagging Regressor Accuracy: %.3f%%, SD: (%.3f%%)" % (results.mean()*100.0, results.std()*100.0))
kfold = KFold(n_splits= 4, random_state= 0)
model = rf_randomCV.best_estimator_
results = cross_val_score(model, X_test, y_test, cv=kfold)
print("Random Forest Regressor Accuracy: %.3f%%, SD: (%.3f%%)" % (results.mean()*100.0, results.std()*100.0))
Insights: As one can see, the gradient boosting regressor model performs the best out of all of the tested models. Thus, this would be an ideal model to use in order to predict concrete strength. This model tends to overfit, yet performs well on unseen data despite this. On the other hand, the decision tree regressor was by far the worst model. This is due to its inherent nature to overfit. If this model were to be implemented, proper pruning must be done in order to prevent overfitting. The bagging regressor model is also a strong candidate to use for this supervised learning problem.